Skip to content

fix: coerce judge score drift#756

Open
schultzjack wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
schultzjack:codex/569-coerce-judge-scores
Open

fix: coerce judge score drift#756
schultzjack wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
schultzjack:codex/569-coerce-judge-scores

Conversation

@schultzjack

@schultzjack schultzjack commented Jun 17, 2026

Copy link
Copy Markdown

Summary

  • normalize LLM-judge score values before enum validation in generated judge response models
  • accept numeric/string drift and simple case/whitespace drift when it maps unambiguously to a configured score option
  • keep unmatched or malformed scores on the existing Pydantic validation path

Scope

This addresses the LLM-judge validation path discussed in #569. It intentionally leaves the broader LLM-structured schema coercion path unchanged.

Testing

  • uv run --group dev pytest packages/data-designer-engine/tests/engine/column_generators/utils/test_judge_score_factory.py -q
  • uv run --group dev pytest packages/data-designer-engine/tests/engine/column_generators/utils/test_judge_score_factory.py packages/data-designer-engine/tests/engine/column_generators/utils/test_prompt_renderer.py packages/data-designer-engine/tests/engine/column_generators/generators/test_llm_completion_generators.py packages/data-designer-engine/tests/engine/models/recipes/test_response_recipes.py -q
  • make check-engine
  • make test-engine

Fixes #569

Signed-off-by: schultzjack <schultzjack@users.noreply.github.com>
@schultzjack schultzjack requested a review from a team as a code owner June 17, 2026 00:31
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@schultzjack

Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

@schultzjack

Copy link
Copy Markdown
Author

recheck

@greptile-apps

greptile-apps Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes LLM-judge score drift by adding a mode="before" Pydantic model validator on BaseJudgeResponse that normalises incoming score values (numeric↔string, case, whitespace) before they reach enum validation, falling back to the standard Pydantic path when the coercion is ambiguous or the value is malformed.

  • _normalize_score_value converts values to a stripped, casefolded string (integer floats become their integer string), enabling reliable equality comparison across types.
  • _coerce_score_value performs an exact-match pass first (with a bool-guard to prevent True/False from silently matching 1/0), then falls back to normalised matching only when exactly one enum member maps to the same normalised form.
  • Four tests covering the primary drift scenarios, nested-model coercion, and unhashable-value fallthrough are added.

Confidence Score: 5/5

The change is self-contained to the judge score coercion path and introduces no mutations to unrelated model behaviour.

The coercion logic is narrowly scoped: it only fires when the field annotation is a concrete Enum subclass, the input is a dict, and exactly one enum member matches after normalisation. The bool-guard in the exact-match phase correctly prevents True/False from silently collapsing onto integer members. Unrecognised or ambiguous values pass through to Pydantic unchanged, preserving existing validation behaviour. The new tests cover the three key drift categories plus the unhashable-value fallthrough path.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/column_generators/utils/judge_score_factory.py Adds _normalize_score_value, _coerce_score_value, and a model_validator(mode='before') on BaseJudgeResponse to coerce LLM-returned score drift (numeric/string/case/whitespace) before Pydantic enum validation; unmatched values fall through to Pydantic unchanged.
packages/data-designer-engine/tests/engine/column_generators/utils/test_judge_score_factory.py Adds four new tests covering int→string, string→int, and case/whitespace coercion via parametrize; nested structured-output coercion; and unhashable-score fallthrough to Pydantic ValidationError.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["LLM returns score value"] --> B["coerce_score model_validator mode=before"]
    B --> C{"data is dict with score?"}
    C -- No --> Z["Pass through unchanged"]
    C -- Yes --> D{"score field is Enum type?"}
    D -- No --> Z
    D -- Yes --> E["_coerce_score_value(value, enum_type)"]
    E --> F{"Exact match with bool guard?"}
    F -- Yes --> G["Return original value"]
    F -- No --> H["_normalize_score_value\nstrip + casefold, float-to-int"]
    H --> I{"Exactly 1 member matches normalized value?"}
    I -- Yes --> J["Return matched member.value"]
    I -- No --> K["Return original value\nambiguous or unrecognised"]
    G --> L["Pydantic validates against enum"]
    J --> L
    K --> L
    L --> M{"Valid enum value?"}
    M -- Yes --> N["Model instance stored value"]
    M -- No --> O["ValidationError raised"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["LLM returns score value"] --> B["coerce_score model_validator mode=before"]
    B --> C{"data is dict with score?"}
    C -- No --> Z["Pass through unchanged"]
    C -- Yes --> D{"score field is Enum type?"}
    D -- No --> Z
    D -- Yes --> E["_coerce_score_value(value, enum_type)"]
    E --> F{"Exact match with bool guard?"}
    F -- Yes --> G["Return original value"]
    F -- No --> H["_normalize_score_value\nstrip + casefold, float-to-int"]
    H --> I{"Exactly 1 member matches normalized value?"}
    I -- Yes --> J["Return matched member.value"]
    I -- No --> K["Return original value\nambiguous or unrecognised"]
    G --> L["Pydantic validates against enum"]
    J --> L
    K --> L
    L --> M{"Valid enum value?"}
    M -- Yes --> N["Model instance stored value"]
    M -- No --> O["ValidationError raised"]
Loading

Reviews (1): Last reviewed commit: "fix: coerce judge score drift" | Re-trigger Greptile

@github-actions

Copy link
Copy Markdown
Contributor

Stale PR reminder

This PR has had failing checks for 7 days without activity.

Failing checks: check

Please push an update or leave a comment if you're still working on this.
Otherwise, this PR will be automatically closed in 7 days.

To prevent auto-close, add the keep-open label.

@andreatgretel

Copy link
Copy Markdown
Contributor

Thanks for the contribution, this is a useful bit of tolerance around judge outputs. I reviewed the score coercion path and the generated Pydantic models. The implementation is nicely scoped and I don't see major blockers, but I'd like a small polish pass before merge.

A couple of test cases would make the new fallback behavior clearer:

  • Add a case for float drift into string score options, e.g. options {"1": "Low quality"} with model output 1.0 should coerce to "1".
  • Add a case for an out-of-range scalar like 99 to confirm it still falls through to Pydantic validation rather than being coerced.

Also, please add a short comment above the bool guard in _coerce_score_value(). It's there because bool is a subclass of int in Python, so True == 1 and False == 0; that context will help keep the guard from looking accidental.

Focused tests and smoke checks passed locally. Once those small coverage/readability items are in, this looks good to merge from my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Schema validation rejects small-model output before coercion can normalize it

2 participants